DM Project - Deliverable 1

Imports

Preprocessing

First we wish to load the dataset and analyze the attributes and the complete structure of the dataset.

We wish to drop any rows which are duplicates.

We want to drop either any rows or columns with NaN. First we will see the sum of NaN in each attribute.

Assesing from the sum, there are only two attributes with NaN. Judging from the data in these attributes, we prefer dropping the complete columns 'Estimation method' and 'Estimation method detailed' because we will not be using the information available in these columns. Dropping rows would result in dropping a complete datapoint which is not feasible.

The final structure of our dataframe is as follows.

After assessing the unique values of each attribute we can see that there is an error in 'RoadCategory' where there are PU and Pu, and TU and Tu values in the dataset separately. We need to fix this.

Correlation between attributes

First we wish to find all the categorical attributes.

Now we wish to make these categorical attributes numerical so that we can make a correlation matrix out of them.

Correlation signifies how two variable move with each other. A positive correlation means that if one variable increases then the other increases and a negative correlation means that if one variable increases the other will decrease.

The diagonal of the heat map consists of one since each variable is perfectly correlated with itself.

The above heat map shows the correlation between each pair of variables. We will just focus on the moderate and strong relations. A correlation is said to be moderate if the magnitude is between 0.4 - 0.59. It is strong between 0.6-0.79 and it is very strong between 0.8 - 1.

An insight we can see is that the types of vehicles are positively related. For example CarTaxis and LightGoodsVehicles this means that increased accidents of CarsTaxis on a roadlink in a given year meant increased accidents of LightGoodsVehicle. This could give valuable insight for why the accidents increased. Most of the different types of vehicles are strongly correlated with each other.

Some examples of categorized correlations are given below.

Moderate Correlations:

Strong Correlations:

Very Strong Correlations:

Yearly analysis

Category-wise analysis

Total accident distribution

Most of our accidents our dominated by the Cars/Taxis category followed by LGVs, and HGVs.

Visualisation 1: LGVs vs. HGVs

Comment: Light Goods Vehicles have shown consistently higher numbers compared to Heavy Good Vehicles and are increasing over the years on average. Heavy Good Vehicles show a stable, consistent number of accidents over the years.

Visualisation 2: Motor vehicles sub-categories plot

Comment: We can see that accidents for cars and taxis are consistently very high over the years on average (compared to the other two categories of motorcycles and buses/coaches). Since these 3 are not comparable together, we will compare motorcycles with buses/coaches

Comment: As we can see, accidents for buses and coaches are fluctuating along 4,000,000 accidents for most of the years, while there is a decreasing trend starting from the year 2007 onwards for Motorcycle accidents. This can be interesting as it may show a reason or cause (perhaps due to a policy of helmets being mandatory or a separate motorcycle lane) which may be inferred from latter analyses.

Comment: We can see that accidents for cars and taxis are consistently very high over the years on average (compared to all other categories). Since the rest categories are not comparable together with cars and taxis, we will eliminate cars and taxis from the graph.

Comment: We can now see the trend of each of these categories. LGV accidents have been increasing on average and are the highest among these categories consistently in terms of accidents over the years. Buses/Coaches and motorcycles follow a very similar stable trend with similar numbers which are lower than the HGV accidents. HGV accidents have slightly decreased over the years.

Analysis according to road type

The pie chart above shows the distribution of ratio of accidents per road type. The top three ratios are of the following road types: TM, PM, and TU.

Visualization of dataset using maps

inspiration taken from: https://medium.com/technology-hits/working-with-maps-in-python-with-mapbox-and-plotly-6f454522ccdd

Visualzing accident locations for all motor vehicles

The heatmap over the UK map shows us that there are approximately 50,000 accidents around all road links. Upon further analysis, we can see that the frequency of accidents increases at the outskirts of major cities.

Visualzing most frequent accident locations for all motor vehicles

To see frequent accident road links, we limited our dataframe with only >100,000 'AllMotorVehicles' accidents. Judging from the heatmap above we can clearly see that road links near the following cities are most susceptible to motor vehicle accidents: London, Cardiff, Southampton, Nottingham, Manchester, Birmingham, Sheffield, and Glasgow.

Visualzing accident locations for non-motorized vehicles (pedal cycles)

The heatmap of the pedal cycles is similar to the one where we were trying to see locations with most frequent accidents. These accidents are near the major cities of United Kingdom.

DM Project - Deliverable 2 - Group 22

Data Loading and Preprocessing

Frequent Pattern Mining

In this part of the project we had to find out the frequency of accidents that happen in a region and in an year. We used the apriori algorithm that we had learnt in class and we used the efficient_apriori library since we found it to be the most efficient during our assignment.

First the relevant features were seperated from the rest of the dataset.

The apriori library takes a list of tuples so we needed to change our dataset accordingly. First we made tuples of the region and year and multiplied the tuples by the number of accidents that correspond to that row, so that we have as many objects of the tuple as in our dataset to the frequent pattern mining.

Since the number of accidents were significant we could not have as many objects as number of accidents since our machine would run out of memory. As an alternative we normalised them by dividing them by 100. So throughout the next few parts the number of accidents are in hundreds.

We run apriori with a low min_support since increasing it will result in low values being returned.

Total number of tuples = 116775398 (in hundreds)

Min_support = 0.001

Threshold = 0.001*116775398 = 116776 (in hundreds)

So our threshold is 116776, any pattern that appears more than this threshold will be carried forward. We have used a low min_support so that we have sufficient number of tuples to see the trends.

These are the frequent patterns that are present in our dataset

The following cells are to process the data before plotting them

The graph below shows the trends of accidents throughout the years in a specific region. This graph gives insights to where accidesnts have been increasing and decreasing.

These insights can be crucial in finding what the cause of the accidents are and how to decrease them.

We also did frequent pattern mining for all different types of vehicles and their respective number of accidents.

For this we needed to make tuples of Year, Region and type of Vehicles. Once again we would run out of memory so we had to divide the count of tuples by a 100 so that we could run apriori on it.

The below tuples give us when the most frequent accidents happened, where and which type of car they contained. The counts need to multiplied by a hundred since they were standardised.

From the above table we can see the stats of accidents. We can change the min_support to filter out more or less frequent itemsets.

Clustering

In this part we did clustering. We made clusters of regions depending on their accidents trends throughout the years. We use KMeans to cluster our data.

Now we have regions and the number of accidents that happen across the years.

We now cluster our data, first we normalise it and then cluster it. From this we use a make_pipeline library that we used in our assignment. We decide on making 5 clusters that our regions will fit in.

The above graph shpows which region falls in which cluster. If we compare the trends in regions that fall in the same cluster they are similar.

For example, the trend in East of England and West Midlands is very similar and hence they fall in the same cluster. We can use the visualisation from the Frequent Pattern Analysis graph.

Meanwhile, London is the only region with a decreasing accidents trend and hence it falls in a cluster alone since no other region follows that trend.

We can use these clusters to determine which regions have similar trends and then we can see what they have in common. Experts can use this information to work on ways to mitigate more accidents in the future.

Throughout our analysis we used AllMotorVehicle accidents so that we can see the overall general trends in the accidents and not limit them to a type of vehicle.